Final Project: Baseball Over the Last 40 Years

Author

Samuel Harris

Research Question + Variables

  • Research Question: How has baseball changed in the past 40 years, and why has it changed?

  • I think this is interesting because baseball has taken a more statistical turn in the past 20 years, and I believe this should be reflected in the data. Teams started to value players based on different metrics, and I believe those desired metrics have increased over time in overall team stats.

  • The data I’m using is from Lahman’s Baseball Database which is found here, http://seanlahman.com/download-baseball-database/. I am focusing on a subset from Lahman’s Database called “teams” which contains various team statistics like home runs and strikeouts each year from 1871 to 2022.

  • Here’s a glance at the summary statistics of the numeric variables I will be focusing on from the data set. I chose these variables since they are fundamental in baseball. The data has been cleaned to only contain team statistics from 1980 onward, and statistics are normalized to per game.

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yearID 0 1 2001.68 12.29 1980.00 1991.00 2002.00 2012.00 2022.00 ▇▆▇▇▇
H 0 1 8.85 0.55 6.50 8.48 8.85 9.22 10.40 ▁▁▇▇▂
HR 0 1 1.00 0.26 0.29 0.81 0.99 1.17 1.97 ▁▇▇▃▁
SO 0 1 6.67 1.24 3.61 5.74 6.52 7.50 10.13 ▁▇▇▅▁
R 0 1 4.53 0.53 3.10 4.16 4.50 4.87 6.23 ▁▆▇▃▁
winpct 0 1 0.50 0.07 0.27 0.45 0.50 0.56 0.72 ▁▅▇▆▁

Variables over Time

  • To look at which variables have changed over time, I calculated the mean for each variable grouped by year. I put these means into a data frame so that league-wide means for each are displayed in the data frame.
  • The outcome variables of interest over time are hits, home runs, and strikeouts.
League-wide Means by Year
yearID H HR SO
1980 9.06 0.73 4.80
1981 8.67 0.64 4.75
1982 8.94 0.80 5.04
1983 8.88 0.78 5.15
1984 8.88 0.77 5.35
1985 8.74 0.86 5.34
1986 8.77 0.91 5.87
1987 9.00 1.06 5.96
1988 8.63 0.76 5.56
1989 8.62 0.73 5.62
1990 8.74 0.79 5.67
1991 8.69 0.80 5.80
1992 8.68 0.72 5.59
1993 9.05 0.89 5.80
1994 9.30 1.03 6.18
1995 9.17 1.01 6.30
1996 9.33 1.09 6.46
1997 9.15 1.02 6.60
1998 9.15 1.04 6.56
1999 9.33 1.14 6.41
2000 9.31 1.17 6.45
2001 9.03 1.12 6.67
2002 8.92 1.04 6.47
2003 9.06 1.07 6.34
2004 9.17 1.12 6.55
2005 9.05 1.03 6.30
2006 9.28 1.11 6.52
2007 9.25 1.02 6.62
2008 9.05 1.00 6.77
2009 8.96 1.04 6.91
2010 8.76 0.95 7.06
2011 8.70 0.94 7.10
2012 8.66 1.02 7.50
2013 8.66 0.96 7.55
2014 8.56 0.86 7.70
2015 8.67 1.01 7.71
2016 8.71 1.15 8.03
2017 8.69 1.26 8.25
2018 8.44 1.15 8.47
2019 8.65 1.40 8.82
2020 8.04 1.28 8.68
2021 8.13 1.22 8.68
2022 8.16 1.07 8.40

Correlation Over Time

  • To visualize correlation between variables, I created a correlation plot for all variables.

  • I then created a table that showed the correlation from each variable to the year variable. Strikeouts were highly correlated, increasing steadily over the past 40 years.
Correlation to Year Table
Variable Cor Pval
SO 0.96 0.00
HR 0.73 0.00
H -0.41 0.01

Scatter Plots

  • These scatter plots visualize each variable’s correlation with time.

  • Again, it’s easily seen that strikeouts have steadily increased over time. But why have strikeouts increased? Aren’t strikeouts a reason why teams lose? Why would teams want to strikeout more? This is the question I will be focusing on next.

Strikeouts

  • New question: What variables predict strikeouts?

  • To see which variables predict strikeouts, I performed a regression analysis. A model containing yearID, hits, and home runs as predictors for strikeouts had a relatively high r squared value of .79.

  • This model also showed a strong positive relationship between home runs and strikeouts, and a weaker negative relationship between hits and strikeouts.

     (1)
    (Intercept) −118.488
    (3.328)
    yearID 0.065
    (0.002)
    H −0.696
    (0.032)
    HR 1.196
    (0.077)
    Num.Obs. 1228
    R2 0.792
    R2 Adj. 0.791
    AIC 2097.3
    BIC 2122.8
    Log.Lik. −1043.636
    RMSE 0.57

Visualizations

  • These plots visualize the relationship between strikeouts and home runs and strikeouts and hits. The plots also have win percentage indicated by size, team indicated by color, and a slider for the year.
  • The positive relationship between home runs and strikeouts is clearly seen.
  • The negative relationship between hits and strikeouts is also clearly seen.

Runs

  • So, what’s more valuable for scoring runs, hits or home runs? A model containing hits, and home runs as predictors for runs had a relatively high r squared value of .82.

  • This model also showed a strong positive relationship between home runs and runs, and a weaker positive relationship between hits and runs. The next tab visualizes the relationships.

 (1)
(Intercept) −1.762
(0.102)
HR 1.178
(0.025)
H 0.578
(0.012)
Num.Obs. 1228
R2 0.822
R2 Adj. 0.821
AIC −187.9
BIC −167.4
Log.Lik. 97.949
RMSE 0.22

Visualizations

  • The first plot shows a clear linear relationship between home runs and runs.

  • However, the second plot shows a different relationship between hits and runs. Runs stay constant as hits increases until hits reach a value of about 8.5. After 8.5 hits, the relationship seems more linear.

  • So, this analysis shows that a team must reach a certain amount of hits before they begin scoring more runs. It also shows that home runs always guarantee that more runs are scored.

Conclusions

  • So, the final conclusion is that strikeouts have increased over time because teams have begun to value home runs more than hits. Teams are willing to to strikeout more if it means they hit more home runs which translate to guaranteed runs.

  • These final plots visually show the increase of strikeouts and home runs over time. There is a slider to view a particular team.